Core: Add position and equality delta writer interfaces by aokolnychyi · Pull Request #3176 · apache/iceberg

aokolnychyi · 2021-09-24T16:04:12Z

This PR adds position and equality delta writer interfaces and contains a subset of changes in PR #2945.

aokolnychyi · 2021-09-24T16:04:41Z

core/src/main/java/org/apache/iceberg/io/BasePositionDeltaWriter.java

+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public class BasePositionDeltaWriter<T> implements PositionDeltaWriter<T> {


This is the writer to use for Spark merge-on-read.

aokolnychyi · 2021-09-24T16:05:46Z

core/src/main/java/org/apache/iceberg/io/BasePositionDeltaWriter.java

+  }
+
+  @Override
+  public WriteResult result() {


I am using the existing WriteResult that @openinx created. It has a builder and already takes care converting values to arrays for serialization.

aokolnychyi · 2021-09-24T16:06:21Z

core/src/main/java/org/apache/iceberg/io/EqualityDeltaWriter.java

+   * @param spec a partition spec
+   * @param partition a partition or null if the spec is unpartitioned
+   */
+  void deleteKey(T key, PartitionSpec spec, StructLike partition);


@rdblue, this is directDelete you mentioned. Also added docs about the schema expectations.

aokolnychyi · 2021-09-24T16:07:08Z

core/src/main/java/org/apache/iceberg/io/EqualityDeltaWriter.java

+ *
+ * @param <T> the row type
+ */
+public interface EqualityDeltaWriter<T> extends Closeable {


This one will be implemented by the CDC writer that I will submit in a separate PR. It is large.

aokolnychyi · 2021-09-24T16:09:22Z

core/src/main/java/org/apache/iceberg/io/PositionDeltaWriter.java

+   * @param spec a partition spec
+   * @param partition a partition or null if the spec is unpartitioned
+   */
+  void delete(CharSequence path, long pos, T row, PartitionSpec spec, StructLike partition);


@rdblue, I kept the optional row before spec and partition. In most new APIs, spec and partition are the last arguments so even though row is an optional argument, having spec and partition as last seems more consistent.

aokolnychyi · 2021-09-24T16:10:02Z

cc @openinx @RussellSpitzer @szehon-ho @rdblue @kbendick @karuppayya @flyrain @pvary @jackye1995 @yyanyy

RussellSpitzer · 2021-09-24T19:17:18Z

core/src/main/java/org/apache/iceberg/io/EqualityDeltaWriter.java

+  /**
+   * Deletes a key from the provided spec/partition.
+   * <p>
+   * This method assumes the delete key schema matches the equality field IDs.


I don't know what this means :)

I'll try to rephrase it then :)

RussellSpitzer · 2021-09-24T20:19:48Z

core/src/main/java/org/apache/iceberg/io/PositionDeltaWriter.java

+  void insert(T row, PartitionSpec spec, StructLike partition);
+
+  /**
+   * Deletes a position in the provided spec/partition without persisting the old row.


I'm not sure I understand what this one means either, the old row is the original row matching the position of this delete file? Why would I be persisting it?

The spec allows us to persist the deleted row in positional delete files. This may be helpful to reconstruct CDC records or to persist the sort key for min/max filtering.

That being said, I don't plan to persist it from Spark.

ah so "Delete a position and record the deleted row in the delete file" vs "Delete a position"

Yeah, that's a good way to put it. I'll update.

RussellSpitzer · 2021-09-24T20:20:09Z

core/src/main/java/org/apache/iceberg/io/PositionDeltaWriter.java

+  }
+
+  /**
+   * Deletes a position in the provided spec/partition and persists the old row.


Same question as last javadoc

RussellSpitzer

Other than Java Doc comments I think this is good to go, I would maybe add tests for an "all delete" and "all insert" operation just to cover those edge cases but the api looks good to me now that @aokolnychyi explained the purposes :)

aokolnychyi · 2021-09-24T22:13:51Z

Updated the Javadoc and also added tests for delete and insert only cases.

rdblue · 2021-09-26T23:08:17Z

Looks great. Thanks for getting these in, @aokolnychyi!

aokolnychyi · 2021-09-27T16:06:38Z

Thanks for reviewing, @rdblue @RussellSpitzer!

Core: Add position and equality delta writer interfaces

8393ec9

github-actions bot added core data flink spark labels Sep 24, 2021

aokolnychyi commented Sep 24, 2021

View reviewed changes

RussellSpitzer reviewed Sep 24, 2021

View reviewed changes

RussellSpitzer approved these changes Sep 24, 2021

View reviewed changes

Review feedback

c78f798

rdblue approved these changes Sep 26, 2021

View reviewed changes

rdblue merged commit ec200e7 into apache:master Sep 26, 2021

Conversation

aokolnychyi commented Sep 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Sep 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Sep 24, 2021

Uh oh!

rdblue commented Sep 26, 2021

Uh oh!

aokolnychyi commented Sep 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RussellSpitzer Sep 24, 2021 •

edited

Loading